A Tidy Framework and Infrastructure to Systematically Assemble Spatio-temporal Indexes from Multivariate Data

H. Sherry Zhang

JSM 2024 Portland, Oregon

Aug 7, 2024

Indexes

Sport climbing in the Olympic Games 🧗‍♀️

2020 Tokyo version

Boulder: 4m wall, 3 problems in final

Lead: 15m wall, 1 problem

Speed: 15m wall, always the same

Three disciplines, one gold medal

In Tokyo 2020, athletes are ranked from 1 to 8 (top - bottom) in each discipline. The final score is the multiplication of the ranks in each discipline.

Athletes Country Speed Boulder Lead Total Rank
Janja Garnbret Slovenia 5 1 1 5 1
Miho Nonaka Japan 3 3 5 45 2
Akiyo Noguchi Japan 4 4 4 64 3
Aleksandra Miroslaw Poland 1 8 8 64 4
. . . .. ..

Aleksandra Miroslaw gets 4th despite ranked last in both boulder and lead:

  • 0/3 in boulder problems (0T0Z)
  • scored 9/40 points in the lead (as compared to others: 13, 20, 21, 29, 34, 35, 37)

But she could win a medal if she performs better in the qualification round.

Hence this year, sport climbing has two medals

  • speed + boulder and lead combined
  • in boulder and lead combined, each discipline worth 100 points:
    • boulder: 25 points x 4 problems, partial points of 5 and 10 for zone 1 and zone2
    • lead: counting from the top, the last 10 moves - 4 points each, the previous 10 moves - 3 points each, … (4 x 10 + 3 x 10 + 2 x 10 + 10 = 100)

the game is on this week, but 1/4am PT 😭…

Inspired from tidymodel

A closer look at a class of drought indexes

The pipeline design (9 modules)

data with spatial (\(\mathbf{s}\)) and temporal (\(\mathbf{t}\)) dimensions: \[x_j(s;t)\]

  • Temporal processing: \(f[x_{sj}(t)]\)
  • Spatial processing: \(g[x_{tj}(s)]\)


  • Variable transformation: \(T[x_j(s;t)]\)
  • Scaling: \([x_j(s;t)- \alpha]/\gamma\)
  • Distribution fit: \(F[x_j(s;t)]\)
  • Normalising: \(\Phi^{-1}[x_j(s;t)]\)


  • Dimension reduction: \(h[\mathbf{x}(s;t)]\)
  • Benchmarking: \(u[x(s;t)]\)
  • Simplification
\[\begin{equation} \begin{cases} C_0 & c_1 \leq x(\mathbf{s};\mathbf{t}) < c_0 \\ C_1 & c_2 \leq x(\mathbf{s};\mathbf{t}) < c_1 \\ \cdots \\ C_z & c_z \leq x(\mathbf{s};\mathbf{t}) \end{cases} \end{equation}\]

Software design

DATA |>
  module1(...) |>
  module2(...) |>
  module3(...) |>
  ...

dimension_reduction(V1 = aggregate_linear(...))
dimension_reduction(V2 = aggregate_geometrical(...))
dimension_reduction(V3 = aggregate_manual(...))

The aggregate_*() function can be evaluated as a standalone recipe, before evaluated with the data in the dimension reduction module:

aggregate_manual(~x1 + x2)
[1] "aggregate_manual"
attr(,"formula")
[1] "x1 + x2"
attr(,"class")
[1] "dim_red"

Confidence interval in the SPI

A bootstrap sample of 100 is taken from the aggregated precipitation series to estimate gamma parameters and to calculate the index SPI for the Texas Post Office station in Queensland.

DATA %>%
  # aggregate monthly precipitation 
  # with a 24-month window
  aggregate(
    .var = prcp, .scale = 24
    ) %>%
  # fit a gamma distribution to 
  # obtain the probability value
  # [0, 1]
  dist_fit(
    .dist = gamma(), .var = .agg, 
    .n_boot = 100
    ) %>%
  # use the inverse CDF to 
  # convert into z-score
  augment(.var = .agg)

Confidence interval in the SPI

80% and 95% confidence interval of the Standardized Precipitation Index (SPI-24) for the Texas post office station, in Queensland, Australia. The dashed line at SPI = -2 represents an extreme drought as defined by the SPI. Most parts of the confidence intervals from 2019 to 2020 sit below the extreme drought line and are relatively wide compared to other time periods. This suggests that while it is certain that the Texas post office is suffering from a drastic drought, there is considerable uncertainty in quantifying its severity, given the extremity of the event.

Pipeline for two drought indexes

idx_spi <- function(.scale, .dist, ...){
  ...
  data %>%                            # data contain `prcp`
    aggregate(.var = prcp,            # step 1: temporal aggregation
              .scale = .scale)%>%     #         aggregate `prcp` with time scale    
                                      #         `.scale` to create `.agg`, by default
    dist_fit(.dist = .dist,           # step 2: distribution fit
             .method = "lmoms",       #         using L-moment to fit `.dist`
             .var = .agg) %>%         #         distribution on `.agg`
    augment(.var = .agg)              # step 3: normalising 
                                      #         find the normal density for `.agg`
}
idx_spei <- function(.scale, .dist, ...){
  ...
  data %>%                                  # data contain `tavg` and `prcp`
    var_trans(                              # step 1: variable transformation
      .method = "thornthwaite",             #         using the thornthwaite function
      .vars = tavg, .new_name = "pet") %>%  #         on `tavg` to create `pet`
    dim_red(diff = prcp - pet) %>%          # step 2: dimension reduction 
    aggregate(                              # step 3: temporal aggregation
      .var = diff,                          #         aggregate `diff` with time scale
      .scale = .scale) %>%                  #         `.scale` to create `.agg`
    dist_fit(                               # step 4: distribution fit
      .dist = .dist, .method = "lmoms",     #         using L-moment to fit `.dist`
      .var = .agg) %>%                      #         distribution on `.agg`
    augment(.var = .agg)                    # step 5: normalising 
                                            #         find the normal density for `.agg`
}

Example

.scale <- c(6, 12, 24, 36)
(idx <- queensland %>%
  mutate(month = lubridate::month(ym)) |>
  init(id = id, time = ym, group = month) |> 
  compute_indexes(
    spei = idx_spei(
      .tavg = tavg, .lat = lat, 
      .scale = .scale, .dist = list(dist_gev(), dist_glo())),
    spi = idx_spi(.scale = .scale)
  ))
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
[1] "Checking for missing values (`NA`): all the data must be complete. Input type is vector. Assuming the data are monthly time series starting in January, all regular (non-leap) years."
# A tibble: 128,576 × 18
   .idx  .dist id       month       ym  prcp  tmax  tmin  tavg  long   lat name 
   <chr> <chr> <chr>    <dbl>    <mth> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
 1 spei  gev   ASN0002…     6 1990 Jun   170  29.7  16.2  23.0  142. -15.5 KOWA…
 2 spei  gev   ASN0002…     7 1990 Jul   102  31.2  17.2  24.2  142. -15.5 KOWA…
 3 spei  gev   ASN0002…     8 1990 Aug     0  31.3  13.1  22.2  142. -15.5 KOWA…
 4 spei  gev   ASN0002…     9 1990 Sep     0  32.8  16.3  24.5  142. -15.5 KOWA…
 5 spei  gev   ASN0002…    10 1990 Oct     0  36.8  21.5  29.2  142. -15.5 KOWA…
 6 spei  gev   ASN0002…    11 1990 Nov   278  36.3  24.8  30.6  142. -15.5 KOWA…
 7 spei  gev   ASN0002…    12 1990 Dec  1869  34.4  24.5  29.4  142. -15.5 KOWA…
 8 spei  gev   ASN0002…    12 1990 Dec  1869  34.4  24.5  29.4  142. -15.5 KOWA…
 9 spei  gev   ASN0002…     1 1991 Jan  5088  31.2  24.4  27.8  142. -15.5 KOWA…
10 spei  gev   ASN0002…     1 1991 Jan  5088  31.2  24.4  27.8  142. -15.5 KOWA…
# ℹ 128,566 more rows
# ℹ 6 more variables: .pet <dbl>, .diff <dbl>, .scale <chr>, .agg <dbl>,
#   .fit <dbl>, .index <dbl>

2010 Queensland flood & 2019-20 Australia drought

All time scales agree on an extreme drought in 2019-20 bushfire season

🔗